Skip to content

Use exact distinct_count from statistics if exists for COUNT(DISTINCT column)) calculations#20845

Merged
alamb merged 4 commits intoapache:mainfrom
buraksenn:use-distinct-count-from-stats-in-distinct-count
Mar 16, 2026
Merged

Use exact distinct_count from statistics if exists for COUNT(DISTINCT column)) calculations#20845
alamb merged 4 commits intoapache:mainfrom
buraksenn:use-distinct-count-from-stats-in-distinct-count

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Does not close but part of #20766

Rationale for this change

If distinct_count is known for the column from statistics we can directly use it in COUNT(DISTINCT column)) expressions instead of calculating.

What changes are included in this PR?

  • Return distinct_count value if exists from statistics
  • Unit tests

Are these changes tested?

Added additional tests

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation labels Mar 10, 2026
// expressions like casts or literals are not supported.
let col_expr = expr.as_any().downcast_ref::<expressions::Column>()?;
if let Precision::Exact(dc) = col_stats[col_expr.index()].distinct_count {
return Some(ScalarValue::Int64(Some(dc as i64)));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes NULL is not counted in distinct_count column. AFAIS from the repository it is not counted so this should be ok but wanted to put a comment here

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @buraksenn couple comments

Ok(())
}

fn mock_data_with_distinct_count(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to combine these two functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not exactly understand but changed it to one table driven test

// expressions like casts or literals are not supported.
let col_expr = expr.as_any().downcast_ref::<expressions::Column>()?;
if let Precision::Exact(dc) = col_stats[col_expr.index()].distinct_count {
return Some(ScalarValue::Int64(Some(dc as i64)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinct count is usize, we should deal with the case where it could overflow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since other paths were casting as well I did not do it actually but let me do it for all paths

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @buraksenn

@alamb alamb added this pull request to the merge queue Mar 16, 2026
@alamb
Copy link
Contributor

alamb commented Mar 16, 2026

Thanks @xudong963 @buraksenn and @jonathanc-n

Merged via the queue into apache:main with commit 26251bb Mar 16, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants